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Abstract 

Many successful applications of computer vision to im- 
age or video manipulation are interactive by nature. How- 
ever, parameters of such systems are often trained neglect- 
ing the user. Traditionally, interactive systems have been 
treated in the same manner as their fully automatic counter- 
parts. Their performance is evaluated by computing the ac- 
curacy of their solutions under some fixed set of user inter- 
actions. This paper proposes a new evaluation and learning 
method which brings the user in the loop. It is based on the 
use of an active robot user - a simulated model of a human 
user. We show how this approach can be used to evaluate 
and learn parameters of state-of-the-art interactive segmen- 
tation systems. We also show how simulated user models 
can be integrated into the popular max-margin method for 
parameter learning and propose an algorithm to solve the 
resulting optimisation problem. 



1. Introduction 

Problems in computer vision are known to be extremely 
hard, and very few fully automatic vision systems exist 
which have been shown to be accurate and robust under all 
sorts of challenging inputs. These conditions in the past had 
made sure that most vision algorithms were confined to the 
laboratory environment. The last decade, however, has seen 
computer vision finally come out of the research lab and 
into the real world consumer market. This great sea change 
has occurred primarily on the back of the development of 
a number of interactive systems which have allowed users 
to help the vision algorithm to achieve the correct solution 
by giving hints. Some successful examples are systems for 
image and video manipulation, and interactive 3D recon- 
struction tasks. Image stitching and interactive image seg- 
mentation are two of the most popular applications in this 
area. Understandably, interest in interactive vision system 



has grown in the last few years, which has led to a number 
of workshops and special sessions in vision, graphics, and 
user- interface conferences 

The performance of an interactive system strongly de- 
pends on a number of factors, one of the most crucial being 
the user. This user dependence makes interactive systems 
quite different from their fully automatic counterparts, es- 
pecially when it comes to learning and evaluation. Surpris- 
ingly, there has been little work in computer vision or ma- 
chine learning devoted to learning interactive systems. This 
paper tries to bridge this gap. 

We choose interactive image segmentation to demon- 
strate the efficacy of the ideas presented in the paper. How- 
ever, the theory is general and can be used in the context 
of any interactive system. Interactive segmentation aims to 
separate a part of the image (an object of interest) from the 
rest. It is treated as a classification problem where each 
pixel can be assigned one of two labels: foreground (fg) or 
background (bg). The interaction comes in the form of sets 
of pixels marked by the user by help of brushes to belong 
either to fg or bg. We will refer to each user interaction in 
this scenario as a brush stroke. 

This work addresses two questions: (1) How to evaluate 
any given interactive segmentation system? and (2) How 
to learn the best interactive segmentation system? Observe 
that the answer to the first question gives us an answer to 
the second. One may imagine a learning algorithm gener- 
ating a number of possible segmentation systems. This can 
be done, for instance, by changing parameter values of the 
segmentation algorithm. We can then evaluate all such sys- 
tems, and output the best one. 

We demonstrate the efficacy of our evaluation methods 
by learning the parameters of the state-of-the-art system for 
interactive image segmentation and its variants. We then go 
further, and show how the max-margin method for learning 
parameters of fully automated structured prediction models 

'e.g. ICV07, and NIPS09 



can be extended to do learning with the user in the loop. 
To summarize, the contributions of this paper are: (1) The 
study of the problems of evaluating and learning interac- 
tive systems. (2) The proposal and use of a user model for 
evaluating and learning interactive systems. (3) The first 
thorough comparison of state-of-the-art segmentation algo- 
rithms under an explicit user model. (4) A new algorithm 
for max-margin learning with user in the loop. 

Organization of the paper In Section [2] we discuss the 
problem of system evaluation. In Section [3] we give de- 
tails of our problem setting, and explain the segmentation 
systems we use for our evaluation. Section |4] explains the 
naive line-search method for learning segmentation system 
parameters. In Section [5] we show how the max-margin 
framework for structured prediction can be extended to han- 
dle interactions, and show some basic results. The conclu- 
sions are given in Section [6] 

2. Evaluating Interactive Systems 

Performance evaluation is one of the most important 
problems in the development of real world systems. There 
are two choices to be made: (1) The data sets on which 
the system will be tested, and (2) the quality or accuracy 
measure. Traditional computer vision and machine learn- 
ing systems are evaluated on preselected training and test 
data sets. For instance, in automatic object recognition, one 
minimizes the number of misclassified pixels on datasets 
such as PASCAL VOC Q . 

In an interactive system, these choices are much harder 
to make because of the presence of an active user in the 
loop. Users behave differently, prefer different interactions, 
may have different error tolerances, and may also learn over 
time. The true objective function of an interactive system - 
although intuitive - is hard to express analytically: The user 
wants to achieve a satisfying result easily and quickly. We 
will now discuss a number of possible solutions, some of 
which, are well known in the literature. 

2.1. Static User Interactions 

This is one of the most commonly used methods in pa- 
pers on interactive image segmentation ||4l [18] 0]. It uses 
a fixed set of user-made interactions (brush strokes) asso- 
ciated with each image of the dataset. These strokes are 
mostly chosen by the researchers themselves and are en- 
coded using image trimaps. These are pixel assignments 
with foreground, background, and unknown labels (see Fig- 
ure [2b). The system to be evaluated is given these trimaps 
as input and their accuracy is measured by computing the 
Hamming distance between the obtained result and the 
ground truth. This scheme of evaluation does not consider 
how users may change their interaction by observing the 



current segmentation results. Evaluation and learning meth- 
ods which work with a fixed set of interactions will be re- 
ferred to as static in the rest of the paper. 

Although the static evaluation method is easy to use, it 
suffers from a number of problems: (1) The fixed interac- 
tions might be very different from the ones made by actual 
users of the system. (2) Different systems prefer different 
type of user hints (interaction strokes) and thus a fixed set 
of hints might not be a good way of comparing two compet- 
ing segmentation systems. For instance, geodesic distance 
based approaches J3] [9] [T8l prefer brush strokes which are 
equidistant from the segmentation boundary as opposed to 
graph cuts based approaches J5][l6). (3) The evaluation 
does not take into account how the accuracy of the results 
improves with more user strokes. For instance, one sys- 
tem might only need a single user interaction to reach the 
ground truth result, while the other might need many inter- 
actions to get the same result. Still, both systems will have 
equal performance under this scheme. These problems of 
static evaluation make it a poor tool forjudging the relative 
performance of newly proposed segmentation system. 

2.2. User Studies 

A user study involves the system being given to a group 
of participants who are required to use it to solve a set of 
tasks. The system which is easiest to use and yields the 
correct segmentation in the least amount of time is consid- 
ered the best. Examples are lfl3l where a full user study 
has been done, or [3 1 where an advanced user has done with 
each system the optimal job for a few images. 

While overcoming most of the problems of a static eval- 
uation, we have introduced new ones: (1) User studies are 
expensive and need a large number of participants to be of 
statistical significance. (2) Participants need to be given 
enough time to familiarize themselves with the system. For 
instance, an average driver steering a Formula 1 car for the 
first time, might be no faster than with a normal car. How- 
ever, after gaining experience with the car, one would ex- 
pect the driver to be much faster. (3) Each system has to be 
evaluated independently by participants, which makes it in- 
feasible to use this scheme in a learning scenario where we 
are trying to find the optimal parameters of the segmenta- 
tion system among thousands or millions of possible ones. 

2.3. Evaluation using Crowdsourcing 

Crowdsourcing has attracted a lot of interest in the ma- 
chine learning and computer vision communities. This is 
primarily due the success of a number of money 1 19 1, rep- 
utation 1241 . and community [17| based incentive schemes 
for collecting training data from users on the web. Crowd- 
sourcing has the potential to be an excellent platform for 
evaluating interactive vision systems such as those for im- 
age segmentation. One could imagine asking Mechanical 
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Table 1: Comparison of methods for interactive learning. 

Turk [1] users to cut out different objects in images with 
different systems. The one requiring the least number of 
interactions on average might be considered the best. How- 
ever, this approach too, suffers from a number of problems 
such as fraud prevention. Furthermore, as in user-studies, 
this cannot be used for learning in light of thousands or even 
millions of systems. 

2.4. Evaluation with an Active User Model 

In this paper we propose a new evaluation methodology 
which overcomes most of the problems described above. In- 
stead of using a fixed set of interactions, or an army of hu- 
man participants, our method only needs a model of user in- 
teractions. This model is a simple algorithm which - given 
the current segmentation, and the ground truth - outputs the 
next user interaction. This user model can be coded up us- 
ing simple rules, such as "give a hint in the middle of the 
largest wrongly labelled region in the current solution", or 
alternatively, can be learnt directly from the interaction logs 
obtained from interactive systems deployed in the market. 
There are many similarities between the problem of learn- 
ing a user model and the learning of an agent policy in re- 
inforcement learning. Thus, one may exploit reinforcement 
learning methods for this task. Pros and cons of evaluation 
schemes are summarized in Table Q] 

3. Image Segmentation: Problem Setting 
3.1. The Database 

We use the publicly available GrabCut database of 50 
images, in which ground truth segmentations are known 
0. In order to perform large scale testing and compari- 
son, we down-scaled all images to have a maximum size of 
241 x 161, while keeping the original aspect ratio^] For each 
image, we created two different static user inputs: (1) A 
"static trimap" computed by dilating and eroding the ground 
truth segmentation by 7 pixel^] (2) A "static brush" con- 
sisting of a few user made brush strokes which very roughly 
indicate foreground and background. We used on average 
about 4 strokes per image. (The magenta and cyan strokes 
in Fig. 2c give an example). All this data is visualized in 



2 We confirmed by visual inspection that the quality of segmentation 
results is not affected by this down-scaling operation. 

3 This kind of input is used by most systems for both comparison to 
competitors and learning of parameters, e.g. |41 |18I . 
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Figure 1 : As detailed in the paper, we took the 50 GrabCut 
images [2| with given ground truth segmentations (coded 
as black/white). We considered two kinds of user inputs 
(codes as red/blue) : User defined strokes and tight trimaps 
generated by eroding the groundtruth segmentation. The 
user strokes where drawn by only looking at the ground 
truth segmentation y fe and ignoring the image x . 



Figure [T] Note, in Sec. 3.3 we will describe a third "dy- 
namic trimap" called the robot user where we simulate the 
user. 

3.2. The Segmentation Systems 

We now describe 4 different interactive segmentation 
systems we use in the paper. These are: "GrabCutSim- 
ple(GCS)", "GrabCut(GC)", "GrabCutAdvanced(GCA)", 
"GeodesicDistance" (GEO). 

GEO is a very simple system. We first learn Gaussian 
Mixture Model (GMM) based color models for fg/bg from 
user made brush strokes. We then simply compute the short- 
est path in the likelihood ratio image as described in [3] to 
get a segmentation. 

The other three systems all built on graph cut. They all 
work by minimizing the energy function: 
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Here (V, £ ) is an undirected graph whose nodes correspond 
to pixels, y p £ {0, 1} is the segmentation label of image 
pixel p with color x p , where and 1 correspond to the back- 
ground and the foreground respectively. We define (V, £) to 
be an 8-connected 2D grid graph. 

The unary terms are computed as follows. A probabilis- 
tic model is learnt for the colors of background (y p = 0) 



and foreground (y p = l) using two different GMMs Pr(x|0) 
and Pr(x|l). E p (y p ) is then computed as — log(Pr(x p |y p )) 
where x p contains the three color channels of pixel p. An 
important concept of GrabCut [fTBI is to update the color 
models based on the whole segmentation. In practice we 
use a few iterations. 

The pairwise term incorporates both an Ising prior and a 
contrast-dependent component and is computed as 



E pq (y P ,y q ) 



\Vg ~ Vp\ 
dist (p, q) 



w c exp 



-P\\x p - x q \\' 



where wi and w c are weights for the Ising and 
contrast-dependent pairwise terms respectively, and 

(3 = 0.5 • Wp/(\\x p — XqW 2 ^ is a parameter, where (■) 
denotes expectation over an image sample [16]. We can 
scale (3 with the parameter wp. 

To summarize, the models have two linear free param- 
eters: Wi,w c and a single non-linear one: wp. The sys- 
tem GC minimizes the energy defined above, and is pretty 
close to the original GrabCut system [16|. GrabCutSim- 
ple(GCS) is a simplified version, where color models (and 
unary terms) are fixed up front; they are learnt from the ini- 



tial user brush strokes (see Sec. 3.2 1 only. GCS will be used 



in max-margin learning and to check the active user model, 
but it is not considered as a practical system. 

Finally, "GrabCutAdvanced(GCA)" is an advanced 
GrabCut system performing considerably better than GC. 
Inspired by recent work lfl4l . foreground regions are 4- 
connected to a user made brush stroke to avoid deserted 
foreground islands. Unfortunately, such a notion of con- 
nectivity leads to an NP-hard problem and various solutions 
have been suggested E31H31 . However, all these are either 
very slow and operate on super-pixels [ 15 1 or have a very 
different interaction mechanism ||23l . We simply remove 
deserted foreground islands in a postprocessing step. 

3.3. The Robot User 

We now describe the different active user models tested 
and deployed by us. Given the ground truth segmentation 
y fc and the current segmentation solution y, the active user 
model is a policy s : (x fc , y k , u k,t , y) i— > u fe:t+1 which 
specifies which brush stroke to place next. Here, u fc '* de- 
notes the user interaction history of image x fe up to time 
t. We have investigated various options for this policy: (1) 
Brush strokes at random image positions. (2) Brush strokes 
in the middle of the wrongly labelled region (center). For 
the second strategy, we find the largest connected region 
of the binary mask, which is given by the absolute differ- 
ence between the current segmentation and ground truth. 
We then mark a circular brush stroke at the pixel which 
is inside this region and furthest away from the boundary. 
This is motivated by the observation that users tend to find 



it hard to mark pixels at the boundary of an object because 
they have to be very precise. 

We also tested user models which took the segmenta- 
tion algorithm explicitly into account. This is analogous 
to users who have learnt how the segmentation algorithm 
works and thus interact with it accordingly. We consider 
the user model which marks a circular brush stroke at the 
pixel (1) with the lowest min marginal (sensit). (2) which 
results in the largest change in labeling (roi size). (3) which 
decreases the Hamming error by the biggest amount (Ham- 
ming). We consider each pixel as the circle center and 
choose the one where the Hamming error decreases most 
(Hamming). This is very expensive, but in some respects 
is the best solution^] "Hamming" acts as a very "advanced 
user", who knows exactly which interactions (brush strokes) 
will reduce the error by the largest amount. It is quite ques- 
tionable that a user is actually able to find the optimal posi- 
tion, and a user study might be needed. On the other hand, 
the "centre" user model behaves as a "novice user". 



Fig. 2c shows the result of a robot user interaction, where 
cyan and magenta are the initial fixed brush strokes (called 
"static brush trimap"), and the red and blue dots are the 
robot user interactions. The robot sets brushes of a maxi- 
mum fixed size (here 4 pixel radius). Apart from the true 
object boundary, the maximum brushes size is used. At the 
boundary, the brush size is scaled down, in order to avoid 
that the brush straddles the boundary. 

Fig. 2d shows the performance of the 5 different user 
models (robot users) over a range of 20 brushes. Here we 
used the GCS system, since it is computationally infeasible 
to apply the (sensit; roi; Hamming) user models on other 
interaction systems. GCS can be used because it allows ef- 
ficient computation of solutions. It does this by recycling 
computation when doing the optimization ifTTIl . In the other 
systems, this is not possible, since unaries change with ev- 
ery brush stroke, and hence we have to treat the system as a 
black box. 

As expected, the random user performs badly. Interest- 
ingly the robot users minimizing the energy (roi, sensit) also 
perform badly. Both "Hamming" and "centre" are consider- 
ably better than the rest. It is interesting to note that "centre" 
is actually only marginally worse than "Hamming". It has 
to be said that for other systems, e.g. GEO this conclusion 
might not hold, since e.g. GEO is much sensitive to the 
location of the brush stroke than a system based on graph 
cut, as [18 1 has shown. To summarize, "centre" is the robot 
user strategy which simulates a "novice user" and is com- 
putational feasible, since it does not look at the underlying 
system at all. Also, "centre" performed for GCS nearly the 
same as the optimal strategy "Hamming". Hence, for the 



4 Note, one could do even better by looking at two or more brushes after 
each other and then selecting the optimal one. However, the solution grows 
exponentially with the number look-ahead steps. 
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Figure 2: An image from the database (a), tight trimap (b), robot user (red/blue) started from user scribbles (magenta/cyan) 
with segmentation (black) after B=20 strokes (c) and segmentation performance comparison of different robot users (d). 



rest of the paper we always stick to the user (centre) which 
we call from here onwards our robot user. 

3.4. The Error Measure 

For a static trimap input there are many different ways 
for obtaining an error rate, see e.g. (4] [TO). In a static 
setting, most papers use the number of misclassified pix- 
els (Hamming distance) between the ground truth segmen- 
tation and the current result. We call this measure "erf,", 
i.e. Hamming error for brush b. One could do variations, 
e.g. [ 1 1 weight distances to the boundary differently, but 
we have not investigated this here. Fig. 



2d shows how the 



Hamming error behaves with each interaction. 

For learning and evaluation we need an error metric giv- 
ing us a single score for the whole interaction. One choice is 
the "weighted" Hamming error averaged over a fixed num- 
ber of brush strokes B. In particular we choose the error 
"Er" as: Er = [J^ b f(erb)]/B where f(erb) — erb- Note, 
to ensure a fair comparison between systems, B must be the 
same number for all systems. Another choice for the qual- 
ity metric which matches more closely with what the user 
wants is described as follows. We use a sigmoid function 
/ : R+ -> [0,c] of the form: 



/(e) 



(er-,,-0.5) 2 



erb < 1.5 
erb > 1.5 ' 



c = 5 (2) 



Observe that / encodes two facts: all errors below 1.5 
are considered negligible and large errors do never weigh 
more than c. The first reasons of this settings is that vi- 
sual inspection showed that for most images, an error below 
1.5% corresponds to a visually pleasing result. Of course 
this is highly subjective, e.g. a missing limb from the seg- 
mentation of a cow might be an error of 0.5% but is visu- 
ally unpleasing, or an incorrectly segmented low-contrast 
area has an error of 2% but is visually not disturbing. The 
second reason for having a maximum weight of c is that 
users do not discriminate between two systems giving large 
errors. Thus errors of 50% and 55% are equally penalized. 

Due to runtime limitations for parameter learning, we do 
want to run the robot user for not too many brushes (e.g. 



maximum of 20 brushes). Thus we start by giving an initial 
set of brush strokes which are used to learn the colour mod- 
els. At the same time, we want that most images reach an 
error level of about 1.5%. When we start with a static brush 
trimap we get for 68% of images an error rate smaller than 
1.5% and for 98% smaller than 2.5%, with the GCA sys- 
tem. We also confirmed that the inital static brush trimap 
does not affect the learning considerably 

4. Interactive Learning by line-search 

Systems with few parameters can be trained by simple 
line (or grid) search. Our systems, GC and GCA, have only 
3 free parameters: w c , Wi,wp. Line search is done by fix- 
ing all but one free parameter and simulating the user 
interaction process for 30 different discrete values w^j of 
the free parameter uu over a predefined range. The opti- 
mal value w*^ from the discrete set is chosen to minimize 
the leave-one-out (LOO) estimator of the test erro^] Not 
only do we prevent overfitting but we can efficiently com- 
pute the Jackknife estimator of the variance [25, ch. 8.5.1], 
too. This procedure is done for all parameters in sequence 
with a sensible starting point for all parameters. We do one 
sweep only. One important thing to notice is that our dataset 
was big enough (and our parameter set small enough) as to 
not suffer from over-fitting. We see this by observing that 
training and test error rates are virtually the same for all ex- 
periments. In addition to the optimal value we obtain the 
variance for setting this parameter. In rough words, this 
variance tells us, how important it is to have this particular 
value. For instance, a high variance means that parameters 
different from the selected one, would also perform well. 
Note, since our error function (Eq. |2| is defined for both, 
trimaps which are static and dynamic, the above procedure 
can be performed for all three different types of trimaps: 
"static trimap", "static brush", "dynamic brush". 

Table [2] summaries all the results, and Fig. [3] illustrates 
some results during training and test (caption explains de- 



5 We started the learning from no initial brushes and let it run for 60 
brush strokes. The learned parameters were similar as with starting from 
20 brushes 

6 This is number-of-data-point-fold cross validation. 
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(a) GCA, w c training (b) GC A, wp training (c) GC, w c training (d) GCA, (w c , wp) test 

Figure 3: Line search. We compare 3 different training procedures for interactive segmentation: Static learning from a fixed 
set of user brushes, static learning from a tight trimap and dynamic learning with a robot user starting from a fixed set of user 
brushes. Train (a-c): Reweighted Hamming errors (± stdev.) for two segmentation systems (GC/GCA) as a function of two 
line-search parameters (w c ,wp). The optimal parameter is shown along with its Jackknife variance estimate (black horizontal 
bar). Test (d): Segmentation performance using the optimal parameters (w* , w*g) after iterated line search optimisation. Note 
that the dynamically learnt paramters develop their strength in the course of interaction. 



tails of the plots). One can observe that the three different 
trimaps suggest different optimal parameters for each sys- 
tem, and are differently certain about them. This leads to 
key contribution of this study: A system which is interac- 
tive in test time has also to be trained in an interactive way. 
We see from the test plots that indeed, a system trained with 
"dynamic trimap" does better than trained with either "static 
brush" or "static trimap". 

Let us look closer at some learnt settings. For system 
GCA and parameter w c (see Table [2] (first row), and Fig. 3a 
we observe that the optimal value in a dynamic setting is 
lower (0.03) than in any of the static settings. This is sur- 
prising since one would have guessed that the true value of 
w c lies somewhere in between a loose and very tight trimap. 
Interestingly in [ 18 1, the authors had learned a parameter by 
averaging the performance from two static trimaps. From 
the above study, one might have concluded the static "tight 
trimap" might give good insights about the choice of param- 
eters. However, when we now consider the training of the 
parameter wp in the GCA system, we see that such a con- 
clusion would be wrong, since the "tight trimap" reaches 
a very different minimum (9.73) than the dynamic trimap 
(2.21)^] To summarize, conclusions about the optimal pa- 
rameter setting of an interactive system should be drawn by 
a large set of interaction and cannot be made by looking 
solely at a few (here two) static trimaps. 
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Table 2: System GCA. Optimal values ± stdev. 



For the sake of completeness, we have the same numbers 
for the GC system in Table [3] We see the same conclusions 
as above. One interesting thing to notice here is that the 
pairwise terms (esp. w c ) are chosen higher than in GCA. 
This is expected, since without post-processing a lot of iso- 
lated islands may be present which are far away from the 
true boundary. So post-processing automatically removes 
these islands. The effect is that in GCA the pairwise terms 
can now concentrate on modeling the smoothness on the 
boundary correctly. However, in GC the pairwise terms 
have to additionally make sure that the isolated regions are 
removed (by choosing a higher value for the pairwise terms) 
in order to compensate for the missing post-processing step. 
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7 Note, the fact that the uncertainty of the "tight trimap" learning is high, 
gives an indication that this value can not be trusted very much. 



Table 3: System GC. Optimal values ± stdev. 

It is interesting to note that for the error metric /(erf,) = 
er;,, we get slightly different values, see Table |4] For in- 
stance, we see that w c = 0.07 ± 0.07 for GCA with our ac- 
tive user. This is not too surprising, since it says that larger 
errors are more important (this is what /(erf,) = er;, does). 
Hence, it is better to choose a larger value of w c . 

In Figure [3ji of the paper we plot the actual segmenta- 
tion error /(erf,) and not the error measure Ylh=i f( er b) 
for /(erf,) = sigmoid(erb). In Table|4] we have collected 
all final error measure values. It is visible from the table 
that the dynamically adjusted parameters only perform bet- 
ter in terms of the instantaneous error but not in terms of the 
cumulative error measure. 

In order to get a complete picture, we provide the full 
set of plots for the line search experiments. We report re- 
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Table 4: System GCA. Optimal values ± stdev. 



suits for the two systems GCA and GC on three parameters 
w c , Wi and nop and two error weighting functions /(er&) in 
Figures[6]and[7] 

Novice vs Advanced User When comparing different in- 
teractive systems, we have to decide, whether the system is 
designed for an advanced or a novice user. 

In a user study, one has full control over selecting ad- 
vanced or novice users. This can be done by changing the 
amount of introduction given to the participants. However, 
this process is lengthy and therefore infeasible for learning. 

In our robot user paradigm, we can simulate users with 
different levels of experience. We run the (center) user 
model to simulate a novice user and evaluate four differ- 
ent systems. The results are shown in Fig. [4] The order of 
the methods is as expected, GCA is best, followed by GC, 
then GCS, and GEO. GEO performs badly since it does no 
smoothing at the boundary, compared to the other systems. 

center user from brush 



1.5 
1.2 




B: number of strokes 



Figure 4: System comparison: Segmentation performance 
comparison between 4 different systems: GCA, GC, GCS 
and GEO using the robot user started from initial user 
brushes. 



5. Interactive Max-margin Learning 

The grid-search method used in Section [4] can be used 
for learning models with few parameters only. Max-margin 
methods deal which models containing large numbers of 
parameters and have been used extensively in computer vi- 
sion. However, they work with static training data and can- 
not be used with an active user model. In this Section, we 
show how the traditional max-margin parameter learning al- 
gorithm can be extended to incorporate an active user. 



5.1. Static SVMstruct 

Our exposition builds heavily on [20 1 and the refer- 
ences therein. The SVMstruct framework [22] allows to 
adjust linear parameters w of the segmentation energy 
E Vf (y, x) (Eq. [Hi from a given training set {x fe , y k } k =i..K 
of K images 6 R n and ground truth segmentations]^] 
y 6 y := {0, 1}" by balancing between empirical risk 
J2k A(y fc j /(x fc )) and regularisation by means of a trade- 
off parameter C. A (symmetric) loss function^] A : y x 
y — > R + measures the degree of fit between two segmen- 
tations y and y*. The current segmentation is given by 
y* = argmin y _E w (y,x). We can write the energy func- 
tion as an inner product between feature functions ipi (y, x) 
and our parameter vector w: E vr (y,x) = w T i/>(y,x). 
With the two shortcuts Sipy — t/>(x fc ,y) — ■0(x fc ,y fc ) and 
£ k = A(y,y k ), the margin rescaled objective ETTl reads 



mm 

£>0,w 

sb.t. 



o(w) 



K S 



(3) 



i yey\y k 



. {w T r5^-4} > 



Vfc. 



In fact, the convex function o(w) can be rewritten as a sum 
of a quadratic regulariser and a maximum over an expo- 
nentially sized set of linear functions each corresponding 
to a particular segmentation y. Which energy functions 
fit under the umbrella of SVMstruct? In principle, in the 
cutting -planes approach |22| to solve Eq. [3] we only re- 
quire efficient and exact computation of argmin y E w (y) 
and arg miriy jy* E w (y) — A(y,y fc ). For the scale of 
images i.e. n > 10 5 , submodular energies of the form 
#w(y) = y T Fy+b T y, F VJ > 0,6* 6 K allow for efficient 
minimisation by graph cuts. As soon as we include connec- 
tivity constraints as in Eq. [T] we can only approximately 
train the SVMstruct. However some theoretical properties 
seem to carry over empirically [8 1. 

5.2. Dynamic SVMstruct with "Cheating" 

The SVMstruct does not capture the user interaction part. 
Therefore, we add a third term to the objective that mea- 
sures the amount of user interaction t where u k £ {0, l} n 
is a binary image indicating whether the user provided the 
label of the corresponding pixel or not. One can think of u fe 
as a partial solution fed into the system by the user brush 
strokes. In a sense u fc implements a mechanism for the 
SVMstruct to cheat, because only the unlabeled pixels have 
to be segmented by our arg min y E w procedure, whereas 
the labeled pixels stay clamped. In the optimisation prob- 
lem, we also have to modify the constraints such that the 
only segmentations y compatible with the interaction u fe 



8 We write images of size (n x , n y , n c ) as vectors for simplicity. All in- 
volved operations respect the 2d grid structure absent in general n-vectors. 
'We use the Hamming loss Aj^(y* ,y k ) = 1 T |y fe — y*|. 



are taken into account. Our modified objective is given by: 

min o(w,U) := \ ||w|| 2 + #l T £+t (4) 

sb.t. miny^i^yfc {w T 5ijj k - 1*} > -£ k Vfc 
l > a T u fe Vfc 

For simplicity, we choose the amount of user interaction or 
cheating i to be the maximal a-reweighted number of la- 
beled pixels l — maxfc ffli|u*|, with uniform weights 
a = a ■ 1. Other formulations based on the average rather 
than on the maximal amount of interaction proved feasible 
but less convenient. We denote the set of all user interac- 
tions for all K images x fc by U = [u 1 , .., u^]. The compat- 
ible label set y\ u k = {0, 1}™ is given by {y G y\u k = 1 =>■ 
Vi = Hi} where y k is the ground truth labeling. Note that 
o(w, U) is convex in w for all values of U and efficiently 
minimisable by the cutting-planes algorithm. However the 
dependence on u k is horribly difficult - we basically have 
to find the smallest set of brush strokes leading to a correct 
segmentation. Geometrically, setting one u k = 1 halves the 
number of possible labellings and therefore removes half of 
the label constraints. The optimisation problem (Eq. [5]) can 
be re-interpreted in two different ways: 
Firstly, we can define a modified energy -E w ,v(y) = 
-^w(y) + J2iev uk< t>i{yii Vi) with additional cheating po- 
tentials </>i(yi,yi) '■= oo for yi ^ y\ and otherwise al- 
lowing to treat the SVMstruct with cheating as an ordinary 
SVMstruct with modified energy function £ w v (y) and ex- 
tended weight vector w = [w; u 1 ; ..; u ]. 
A second (but closely related) interpretation starts from the 
fact that the true label y k can be regarded as a feature vec- 
tor of the image x*{^] Therefore, it is feature selection in a 
very particular feature space. There is a direct link to mul- 
tiple kernel learning - a special kind of feature selection. 

5.3. Optimisation with strategies 

We explored two approaches to minimise o(w,U). 
Based on the discrete derivative J^y, we tried coordinate 
descent schemes. Due to the strong coupling of the vari- 
ables, only very short steps were possible^] Conceptu- 
ally, the process of optimisation is decoupled from the user 
interaction process, where removal of already known la- 
bels from the cheating does not make sense. At every 
stage of interaction, a user acts according to a strategy 
s : (x fc , y fe , u fc '*, y, w) i— > u M+i, The notion of strat- 
egy or policy is also at the core of a robot user. In order to 
capture the sequential nature of the human interaction and 



10 In fact, it is probably the most informative feature one can think of. 
The corresponding predictor is given by the identity function. 

1 1 In the end, we can only safely flip a single pixel vi at a time to guar- 
antee descent. 



assuming a fixed strategy s, we relax Eq. |4]to 
^min^ o(w,T) := \ ||w|| 2 + § 1 t £+l (5) 

sb.t. min ye3 ;| ufe _ r \ y * {w T Sip k - £ k } > Vfc 
i > a T u fc ' T , u k - T = s T (x k ,y k ,w) Vfc 

where we denote repeated application of the strategy s by 
s T (x fc ,y fe ,w) = Qj~Qs(-x k ,y k ,u k > t ,w) and by O the 
function concatenation operator. Note that we still cannot 
properly optimise Eq. [3] However, as a proxy, we develop 
Eq. [5] forward by starting at t = with u fc '°. In every 
step t, we interleave the optimisation of the convex objec- 
tive o(w* , t) and the inclusion of a new user stroke yielding 
w T as final parameter estimate. 

5.4. Experiments 

We ran our optimisation algorithm with GCS on 5-fold 
CV train/test splits of the GrabCut images. We used unary 
potentials (GMM and flux) as well as two pairwise poten- 
tials (Ising and contrast) and the center robot user with 
B = 25 strokes. Fig. |3J) shows, how the relative weight 
of the linear parameters varies over time. At the begin- 
ning, smoothing (high is needed whereas later, edges 
are most important (high w c ). Also the SVMstruct objec- 
tive changes. Fig. |5}; makes clear that the data fit term 
decreases over time and regularisation increases. However, 
looking at the test error in Fig. [5^ (averaged over 5 folds) 
we see only very little difference between the performance 
of the initial parameter w° and the final parameter w T . Our 
explanation is based on the fact that GCS is too simple as 
it does not include connectivity or unary iterations. In ad- 
dition to the Gaussian Mixture Model (GMM) based color 
potentials, we also experimented with flux potentials lfl2l as 
a second unary term. Figure [5]r> shows one example, where 
we included a flux unary potential. We get almost identical 
behavior without flux unaries. 

6. Conclusion 

This paper showed how user interaction models (robot 
users) can be used to train and evaluate interactive systems. 
We demonstrated the power of this approach on the problem 
of parameter learning in interactive segmentation systems. 
We showed how simple grid search can be used to find good 
parameters for different segmentation systems under an ac- 
tive user interaction model. We also compared the perfor- 
mance of the static and dynamic user interaction models. 
With more parameters, the approach becomes infeasible, 
which naturally leads to the max margin framework. 

We introduced an extension to SVMstruct, which allows 
it to incorporate user interaction models, and showed how 
to solve the corresponding optimisation problem. How- 
ever, crucial parts of state-of-the-art segmentation systems 
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Figure 5: Max-margin stat/dyn: a) Segmentation perfor- 
mance using GCS when parameters are either statically or 
dynamically learnt, b) Evolution of w during the optimisa- 
tion, c) Evolution of the first two terms of o(w). 

include (1) non-linear parameters, (2) higher-order poten- 
tials (e.g. enforcing connectivity) and (3) iterative updates 
of the unary potentials ingredients that cannot be handled 
directly inside the max-margin framework. In future work, 
we will try to tackle these challenges to enable learning of 
optimal interactive systems. 
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Figure 6: Learning with grid search (single parameter at a time), /(er&) — sigmoid(erb), a-f training and g-1 testing 
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Figure 7: Learning with grid search (single parameter at a time), f(erb) — er^, a-f training and g-1 testing 



